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Ashot Vardanian 


Design 


Nice to meet you! 


| am Ash 


e 2003 - 15 
+ Olympiads, Web, IOS, MacOS, 
e Astrophysics & Scientific Computing 
e 2015 - 
s Started Unum to bulla the largest intelligent systems. 
e Worked on Neural Nets, Graphs, Analytics, Compression, Encryption... 


@ashvardanian 


| was working on nanosecond optimizations 


When | faced bottlenecks in storage: Postgres, MongoDB, Neo4J... 


intel Intel Intrinsics e CUDA Intrinsics %/ LLVM Intrinsics L GCC Intrinsics 


x = mm256 and ps(x, (  m256) mm256 set1 epi32(-0x7f800000)); 
x = mm256 or ps(x, mm256 set1 ps(0.5f)); 

imm@ =  mm256 sub epi32(imm0,  mm256 set1l epi32(0x7f)); 

|. m256 e = mm256 cvtepi32 ps(imm0); 

e = mm256 add ps(e, one); Browser Homepage 
. m256 mask = mm256 cmp ps(x, mm256 set1 ps(0.707106781186547524),  CMP LT 0S); 
| m256 tmp = mm256 and ps(x, mask); 

x = mm256 sub ps(x, one); 

e = mm256 sub ps(e, mm256 and ps(one, mask)); 

x = mm256 add ps(x, tmp); 

|. m256z = mm256 mul ps(x, x); 

x = mm256 max ps(x, mm256 set1 ps(-88.3762626647949f)); 

fx = mm256 mul ps(x, mm256 set1 ps(1.44269504088896341)); 

fx = mm256 add ps(fx, mm256 set1 ps(0.5f)); 

tmp = mm256 floor ps(fx); 


No shortage of alternative databases 


Company Raised in 2021 Total Raised Valuation Total Rounds Raised in 2021, 76 
(CockroachDB SEM IMB 9 
PO sau au BO o m 
Cose ||| 300M 30M BR an 
Yugabyte SMO saa B s sa 
l Redis MEM Ba 
TigerGraph MMB 


unum.cloud: DBMS Gold Rush of 2021 


Most new Databases grow on Rocks G 


LevelDB + Transactions + LSM Tree 


+ Facebook MyRocks = MySQL on RocksDB (X) 
e Twitter: Manhattan aistriouted store on ROCKSDB wy 
+ Yahoo: Sherpa distributed store on RocksDB yahoo? 


+ CockroachDB = Distributed Postgres on RocksDB \ 
+ Yugabyte = Distributed Postgres on ROCKSDB 


+ Apache Samza, Karka, .. 


Diving Into RocksDB 


Felt wrong after SIMD 


523 virtual inline Status Get(const ReadOptions& options, 

524 ColumnFamilyHandlex column family, const Slice& key, 

525 std::stringx value) { 

675 virtual void MultiGet(const ReadOptions& options, 

676 ColumnFamilyHandlex column family, 

677 const size t num keys, const Slicex keys, 

678 PinnableSlicex values, Statusx statuses, l) 

679 const bool /xsorted_input*/ = false) 1 STL containers ¥ 
680 std::vector«ColumnFamilyHandlex» cf; Global allocators V 
681 std::vector<Slice> user keys; ] . 
682 std::vector«Status» status; Excessive allocations Y 
683 Std::vector<std::string> vals; 


rocksdb/include/rocksdb/db.h 


Same Story with File Structure 
BlockBasedTable Format isn't NVMe-Friendly 


«beginning of file» 
[data block 1] 
[data block 2] 


[data block 
[neta block 
[meta block 
[meta block 
[meta block 
[meta block 


I! 


filter block] (see section: "filter" Meta Block) 

index block] 

compression dictionary block] (see section: "compression dictionary" Meta Block) 
range deletion block] (see section: "range deletion" Meta Block) 

stats block] (see section: "properties" Meta Block) 


Ln E L K KA Z 


[meta block K: future extended block] (we may add more meta blocks in the future) 
[metaindex block] 
[Footer] (fixed size; starts at file size - sizeof(Footer)) 


send-of fite» blocks compensating 
for poor design choices 


Too many functional 


rocksdb/wiki/Rocksdb-BlockBasedTable-Format 
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struct WrappedReadReguest ( High-Cost Abstractions 


FSReadReguest* reg; 


iuinidou AOV: „Over Io, uring and liburing 
size t finished Len: 
explicit WrappedReadRequest(FSReadRequest* r) : req(r), finished len(0) {} 


}; 


autovector<WrappedReadRequest, 32> reg wraps; 


Wrapping requests with 


autovector<WrappedReadReguest*, 4» incomplete rą list; 
std::unordered set«WrappedReadRequestx» wrap cache; metadata negates the 


benefits of deep queues 
req wraps.emplace back(&reqs[il); with heap-allocated 
) vectors and complex 
sync logic 


for (size t i = 0; i < num reqs; i++) { 


size t reqs off = 0; 
while (num reqs > reqs off || !incomplete rą list.empty()) { 
size t this regs = (num reqs - reqs off) + incomplete rq list.size(); 


// If requests exceed depth, split it into batches 


if (this reqs » kIoUringDepth) this reqs - kIoUringDepth; rocksdb/env/io posix.cc 
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static ssize t ext4 dio read iter(struct kiocb x*iocb, struct iov iter xto) 


t 


Ext4 Filesystem Example 


ssize t ret; 
struct inode *inode = file inode(iocb->ki filp); 


if (iocb->ki flags & IOCB NOWAIT) ( 
if (!inode trylock shared(inode)) 
return -EAGAIN; 


SPDK 


I nc 


Block I/O Layer 


Application Device Layer 


} else { 
inode lock shared(inode); 


mmap 


Page Cache 


bc Userspace Buffers 


if (!ext4 should use dio(iocb, to)) { 
inode unlock shared(inode); 
/* 


x Fallback to buffered I/0 if the operation being performed on Most modern IO Qoes 
x the inode is not supported by direct I/O. The IOCB DIRECT 

* flag needs to be cleared here in order to ensure that the 

s through more layers, 
* 

ioch-»ki, flags &= «I0C8 DIRECT than presented on diagram, 


return generic file read iter(iocb, to); 


| locking mutexes everywhere. 


ret = iomap dio rw(iocb, to, &ext4 iomap ops, NULL, 0, NULL, 0); 


direct I/O path within generic file read iter() is not 


taken. 


inode unlock shared(inode); 


file accessed(iocb->ki filp); linux/fs/ext4/file C 


return ret; 


Modern Key-Value Stores at Glance 


Have three parts 


+ Concurrent Mem-Table: allocator- dependent Skip-List 
* Versioning & Garbage Collection: slow compactions 
+ IO Logic: synchronous, interrupting, or poor async 


Topic of Today 


Which are the IO options? 


In order of maturity 


+ UNIX IO system calls 

* POSIX AIO since Linux kernel 2.5 - 2002 

+ io uring since Linux kernel 5.1 ~ 2019 

* Magnum IO for Nvidia GPUs, including GPU Direct Storage 
s SPDK on Linux 


Sub-Topic of Today 


Intel & Micron announced 3D X Point in 2015 


But they had no IO stack ready for 5 us devices 


Build Ultra High-Performance 
Storage Applications with the 
Storage Performance Development 
Kit 

The Storage Performance Development Kit (SPDK) provides a set of 


tools and libraries for writing high performance, scalable, user-mode 
storage applications. 


Get started 


spdk.io 


SPDK Hello World 


6 steps, 500 lines of code © 


— 
e 


O Root privileges 

W Probe for NVMe controllers 

W Create multiple non-thread-safe IO queues per controller 
i) Allocate page-aligned buffers with pinned addresses 
Submit requests 


D 0O KR WN 


Poll for completion 


spdk/examples/nvme/hello_world/hello_world.c 


To squeeze everything from SPDK 


You should: 
+ Forget about filesystem 
SPDK gives you a raw block device. 


You dont have filenames, nested paths, etc. 
But you also dont pay for tons of legaoy synchronous FS code. 


To squeeze everything from SPDK 


You should: 


+ Forget about filesystem 
* Forget about page-caching 


Everything Is designed for O. DIRECT; so you don't pay for kswapao . 
Need a cacne - write one. 


To squeeze everything from SPDK 


You should: 


+ Forget about filesystem 
+ Forget about page-caching 
+ Forget about addressing bytes, and focus on pages 


uint32 t spdk bdev get data block size (const struct spdk bdev *bdev) size t spdk bdev get buf align (const struct spdk bdev *bdev) 
Get block device data block size. More... Get minimum I/O buffer address alignment for a bdev. More... 
uint32 t  spdk bdev get physical block size (const struct spdk bdev *bdev) uint32 t spdk bdev get optimal io boundary (const struct spdk bdev *bdev 


Get block device physical block size. More... Get optimal I/O boundary for a bdev. More... 


Lets benchmark 


On bare Metal, no RAID 


AMD Threaaripper PRO 3995WX 
128 threads (o 2.7 GHz 

Sx Samsung MS9Y38AAG40M32-CAE 
1 TB RAM @ 3.2 GHz, 204 GB/s 

GX Samsung PMI/33 U.2 
64 TB NVMe @ 48 GB/s 

Ax Nvidia RTX 3090 


Lets benchmark 


On bare Metal, no RAID 


e UNIX IO: 50.7k IOPS 
e On | SSD 


Lets benchmark 


On bare Metal, no RAID 


e UNIX IO: 50.7k IOPS 
e POSIX AIO: 573k IOPS 
e On | SSD 


Let's benchmark 


On bare metal, no RAID 


UNIX IO: 50.7k IOPS 

POSIX AIO: 573k IOPS 

io. uring: 869k IOPS 

e Over BM IOPS on 8 SSDs with 24 threads 


Let's benchmark 


On bare metal, no RAID 


e UNIX IO: 50.7k IOPS 
e POSIX AIO: 573k IOPS 
+ |o uring: 869k IOPS 
+ SPDK: 1.2M [OPS 
e Over 9M IOPS on 8 SSDs with 24 threads 


No native SPDK support in fio, only through xNVME. 


Real World Performance 
From Synthetic IO to KVS and DBMS 


Engine Random Batch Writes Random Batch Reads 
RocksDB 57,000 - 200 MB/s 650,000 - 2.6 GB/s 
UDisk 320,000 - 1.3 GB/s - 5.8x 4,200,000 ~ 16.8 GB/s ~ 6.5x 


On IO TB collections, with 1 TB of RAM, 8x SSDs and 32 cores 
With 8 byte keys and misaligned direct accesses 


unum.cloud/ucsb 


UKV: The BLAS of CRUD 


Open Binary Interface Standard 


ext: H Javascript 


Started in Summer 2022 


github.com/unum-cloud/ukv 


UKV Backends 


Hidden Complexity 


Backend 


"Had mg Mn l. LZ £l 
"d Cige pl Is G 


Distributions 


Modalities 


Lia es a =a CS Ls p ge | => =, 
“Vectors A sharded x 


| | | 
4 r | a 
UMem ACID Store | RocksDB Persistent ACID Store | LevelDB Persistent Store 


UDisk Persistent ACID Store 


github.com/unum-cloud/ukv 
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UKV C Standard 


Supports strides, like BLAS 


ukv key t key 1 42 |); ukv key t keys[2] = { 42, 43 ); 
ukv bytes cptr t value į "meaning of life" L: ukv bytes cptr t values[2] 1 "meaning of life", "is unknown" L: 
ukv write t write 1 ukv write t write 1 
„db = db, „db = db, 
„keys = &key, „tasks count = 2, 
„values = &value, „keys = keys, 
„error = &error, „keys stride = sizeof(ukv key t), 
L: „values = values, 
ukv write(&write); „values stride = sizeof(ukv bytes cptr t), 


„error = &error, 
Z 


ukv write(&write); 


github.com/unum-cloud/ukv/include/ukv/blobs.h 


25 


UKV Frontends 


Performance is Accessible 


IndluxDB Ray 


github.com/unum-cloud/ukv 


UKV Python SDK 


Performance is Accessible 


main collection[42] = binary, string main collection[[42, 43, 44]] 
main collection.set(42, binary string) Main collection[(42, 43, 44)! » ARROW 
42 in main collection import pyarrow as pa 


main collection.has key(42) keys = pa.array([1000, 2000], type=pa.int64()) 
strings: pa.StringArray = pa.array(['some', 'text']) 


| | main collection[keys] - strings 
main collection [42] 


main collection.get(42) 
rows batch = main collection.sample(1 000) 


values batch = main collection.docs.table[['name', 


'age']]. loc[rows. batch] 
del main collection[42] 


main collection.pop(42) I 
df E pandas + 9 » NetworkX 
Ñ Ç Network Analysis in Python 


github.com/unum-cloud/ukv 
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Give it a try 


And join the development! 


9 


T 
g 


unum-cloud/ukv oip install ukv tme/cpparm 


Linux, GCC, C++, Python: Today 
MSVC, AppleClang, GoLang, Java: Soon 


@ashvardanian 


Check out Unum.Cloud 
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@ashvardanian 
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